System and method for assignment of unique identifiers in a distributed environment

ABSTRACT

System and method for assignment of unique identifiers in a distributed environment. In a distributed system having a plurality of nodes, a method for allocating identifiers comprises steps of: allocating a pool of identifiers for use; maintaining lists of free identifiers in the pool at participating nodes; obtaining at a first node permission to update the lists of free identifiers; upon receiving permission to update the lists, allocating for the first node a set of identifiers from the lists; updating the lists of free identifiers to reflect allocation of the set of identifiers for the first node; sending the updated lists from the first node to other participating nodes; upon receiving the updated lists of free identifiers at each other participating node, updating each other participating node&#39;s respective copy of the lists of free identifiers; and relinquishing the first node&#39;s permission to update the lists of free identifiers.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application is related to and claims the benefit of priorityof the following commonly-owned, presently-pending provisionalapplication(s): application Ser. No. 60/767,522, filed May 10, 2006,entitled “System and Method for Assignment of Unique Identifiers in aDistributed Environment”, of which the present application is anon-provisional application thereof. The disclosure of the foregoingapplication is hereby incorporated by reference in its entirety,including any appendices or attachments thereof, for all purposes.

COPYRIGHT STATEMENT

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever.

APPENDIX DATA

Computer Program Listing Appendix Under Sec. 1.52(e):

This application includes a transmittal under 37 C.F.R. Sec. 1.52(e) ofa Computer Program Listing Appendix. The Appendix, which comprises textfile(s) that are IBM-PC machine and Microsoft Windows Operating Systemcompatible, includes the below-listed file(s). All of the materialdisclosed in the Computer Program Listing Appendix can be found at theU.S. Patent and Trademark Office archives and is hereby incorporated byreference into the present application.

Object Description: SourceCode.txt, size: 34994 Bytes, created: Mar. 19,2007 10:05:16 AM; Object ID: File No. 1; Object Contents: Source code.

BACKGROUND OF INVENTION

1. Field of the Invention

The present invention relates generally to data processing environmentsand, more particularly, to allocation and management of uniqueidentifiers, especially in distributed database environments.

2. Description of the Background Art

Computers are very powerful tools for storing and providing access tovast amounts of information. Computer databases are a common mechanismfor storing information on computer systems while providing easy accessto users. A typical database is an organized collection of relatedinformation stored as “records” having “fields” of information. As anexample, a database of employees may have a record for each employeewhere each record contains fields designating specifics about theemployee, such as name, home address, salary, and the like.

Between the actual physical database itself (i.e., the data actuallystored on a storage device) and the users of the system, a databasemanagement system or DBMS is typically provided as a software cushion orlayer. In essence, the DBMS shields the database user from knowing oreven caring about the underlying hardware-level details. Typically, allrequests from users for access to the data are processed by the DBMS.For example, information may be added or removed from data files,information retrieved from or updated in such files, and so forth, allwithout user knowledge of the underlying system implementation. In thismanner, the DBMS provides users with a conceptual view of the databasethat is removed from the hardware level. The general construction andoperation of database management systems is well known in the art. Seee.g., Date, C., “An Introduction to Database Systems, Seventh Edition”,Part I (especially Chapters 1-4), Addison Wesley, 2000.

In any networked system, whether database or otherwise, uniqueidentifiers are needed for various purposes. In modern computingsystems, for instance, unique identifiers are often associated withsystem generated objects to clearly identify an object to the system anddistinguish any one object from others in the system. Examples includeusers, groups of users, files, and hardware components (e.g., printers),all of which have a unique identifier assigned. The generation andmaintenance of unique identifiers therefore serves a critical role indata processing environments.

In database systems, unique identifiers are needed when new objects arecreated. For instance, one typical use is for assigning uniqueidentifiers for rows in a table, that is “row IDs” (popularly known asROWIDs or RIDs) to uniquely identify a particular row. Particularlychallenging is the allocation and management of unique identifiers foruse in distributed systems. In a distributed system, all the computersin the network need unique identifiers for doing their computation;objects processed by those computers may themselves need to be uniquelyidentified across the multiple computers that comprise the distributedsystem. Thus, there is a need for allocating and managing identifiersthat are unique across the entire distributed system.

One approach to the problem is described in U.S. Pat. No. 6,457,053. Theapproach is based on a master-slave relationship as follows. A systemfor multi-master unique identifier allocation includes a server forallocating pools of identifiers to requesting servers and at least oneserver for requesting pools of identifiers and allocating individualidentifiers as necessary. A single master server allocates “pools” ofunique identifiers to network servers upon request. The network servers,in turn, allocate unique identifiers from their pool as necessary whenthe network server generates new system objects. When a network server'spool of unique identifiers is nearly depleted, the network serverrequests an additional pool of identifiers from the master server.

However, as the approach is based on a master-slave relationship, it isnot fully distributed (i.e., it requires centralized management). Thismay lead to problems and limitations, including inefficient messagingand inferior scalability. As one example, the reliance on centralizedmanagement has limitations and inefficiencies in handling the additionand deletion of server nodes of the distributed system during ongoingoperations, particularly in the case of failure (e.g., crash) of one ormore of the server nodes. For instance, if the master server fails or isshut down, another server must take over the role of master, includingobtaining the information necessary to serve in the master role. Also,in the event that one of the “slave” nodes (i.e., a network server otherthan the master) fails or is shut down, unique identifiers assigned tosuch node may be lost (i.e., no longer available) for use by the otherservers. In view of limitations and inefficiencies such as these, abetter solution is sought. In particular, what is needed is a systemincorporating a methodology of assigning these unique identifiers in amanner which is not centralized and may therefore fully participate inall of the features and benefits of distributed environments, includingimproved scalability. Ideally, the solution should gracefully andefficiently handle the addition and removal of networked server nodesfrom the cluster, including those resulting from failure of one or moreof the nodes. The present invention fulfills this and other needs.

SUMMARY OF INVENTION

System and method for assignment of unique identifiers in a distributedenvironment is described. In one embodiment, for example, in adistributed system having a plurality of nodes, a method of the presentinvention is described for allocating identifiers for use at nodes ofthe distributed system, the method comprises steps of: allocating a poolof identifiers for use in the system; maintaining lists of freeidentifiers in the pool at participating nodes in the system; obtainingat a first node permission to update the lists of free identifiers; uponreceiving permission to update the lists, allocating for the first nodea set of identifiers from the lists of free identifiers; updating thelists of free identifiers to reflect allocation of the set ofidentifiers for the first node; sending the updated lists of freeidentifiers from the first node to other participating nodes; uponreceiving the updated lists of free identifiers at each otherparticipating node, updating each other participating node's respectivecopy of the lists of free identifiers; and relinquishing the firstnode's permission to update the lists of free identifiers.

In another embodiment, for example, a system of the present inventionfor allocating identifiers for use at a plurality of database servernodes sharing access to a database is described that comprises: adatabase; a plurality of database server nodes connected to each othervia a network and sharing access to the database; an identity datastructure at each database server node for maintaining information aboutidentifiers allocated for use among the plurality of database servernodes; a distributed lock for regulating access to the identity datastructure at the plurality of server nodes; and an identity manager,responsive to a request for identifiers at a given node, for acquiringthe distributed lock on the identity data structure, allocating a set ofidentifiers to the given node upon acquiring the distributed lock,updating the identity data structure at the plurality of database servernodes to reflect allocation of the set of identifiers for the givennode, and relinquishing the distributed lock after completion of updatesto the identity data structure at all database server nodes.

In yet another embodiment, for example, in a distributed database systemhaving a plurality of nodes sharing access to a database, a method ofthe present invention is described for allocating identifiers for use atnodes of the distributed system, the method comprises steps of:allocating a pool of identifiers from the database for use among theplurality of nodes of the distributed database system; maintaininginformation about the pool of identifiers at each of the nodes in anidentity data structure; in response to a request for identifiers at agiven node, providing identifiers from a set of identifiers previouslyallocated for use at the given node while identifiers remain in the setallocated for the given node; when no identifiers remain in the set,obtaining additional identifiers for use at the given node by performingsubsteps of: acquiring a lock on the identity data structure at each ofthe nodes; allocating identifiers available in the pool for use at thegiven node; updating the identity data structure to reflect allocationof the identifiers for the given node; and upon updating the identitydata structure at all nodes, relinquishing the lock.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a very general block diagram of a computer system (e.g., anIBM-compatible system) in which software-implemented processes of thepresent invention may be embodied.

FIG. 2 illustrates the general structure of a client/server databasesystem suitable for implementing the present invention.

FIG. 3 is a high-level block diagram of a Shared Disk Cluster databasesystem environment illustrating an example of a four node Shared DiskCluster accessed by a plurality of clients.

FIG. 4 is a high level block diagram illustrating a distributed systemenvironment in which the system and methodology of the present inventionmay be implemented.

FIG. 5A comprises a flowchart illustrating methodology for allocation ofidentifiers to tasks at a given node.

FIG. 5B comprises a flowchart illustrating methodology for allocation ofidentifiers to a given node in response to demand for identifiers atsuch node.

FIG. 5C comprises a flowchart illustrating methodology for an identitymanager at a given node to obtain an additional block of identifiersfrom disk for use in the cluster.

FIG. 6 is a flowchart illustrating the methodology of the presentinvention for scavenge of identifiers when a node leaves the cluster.

FIGS. 7A-B comprise a single flowchart illustrating the methodology ofthe present invention for instantiation/recovery of the identity datastructure when a cluster boots.

DETAILED DESCRIPTION Glossary

The following definitions are offered for purposes of illustration, notlimitation, in order to assist with understanding the discussion thatfollows.

Cluster: A collection of more than one networked and usually homogeneoussystems, which function as a single system. All the instances in thecluster communicate with each other typically through privateinterconnects.

Clustered Server: A clustered server refers to a database serverinstance which runs on a Shared Disk Cluster and jointly manages asingle installation of the database on the shared disks.

DES: DES refers to table descriptor, which in the currently preferredembodiment of the present invention is stored on all instances in thecluster.

DES Scavenge: Refers to the operation of scavenging an objectdescriptor, where the in-memory object descriptor needs to be releasedand the memory reused for some other purpose.

Failover Handling: The recovery of a node which failed in the cluster,by another node in the cluster.

Identity Column: A column of a table which has the property “IDENTITY”and which is populated automatically with a new value every time a rowis inserted into the table. This new identity value is internallygenerated and is guaranteed to be different from all the previouslygenerated identity values for that table.

Identity value or identifier: a value used to populate the identitycolumn of a table. In the currently preferred embodiment of the presentinvention, identity values are unique, but are not required to bemonotonically increasing.

Network: A network is a group of two or more systems linked together.There are many types of computer networks, including local area networks(LANs), virtual private networks (VPNs), metropolitan area networks(MANs), campus area networks (CANs), and wide area networks (WANs)including the Internet. As used herein, the term “network” refersbroadly to any group of two or more computer systems or devices that arelinked together from time to time (or permanently).

Node or Server Instance: A clustered server in a Shared Disk Cluster.

Node (Instance) Failover: refers to the possibility of an instance inthe cluster going down due to software or hardware failure, resulting ina surviving coordinator instance taking action to recover the instance.

OCM lock: A distributed lock which serves as the exclusive right toupdate a data value in a distributed networked system.

PSS: PSS refers to a task descriptor.

Relational database: A relational database is a collection of data itemsorganized as a set of formally-described tables from which data can beaccessed or reassembled in many different ways without having toreorganize the database tables. The relational database was invented byE. F. Codd at IBM in 1970. A relational database employs a set of tablescontaining data fitted into predefined categories. Each table (which issometimes called a relation) contains one or more data categories incolumns. The standard user and application program interface to arelational database is the structured query language (SQL), definedbelow.

Shared Disk Cluster: Refers to a cluster system where multiple databaseserver instances on two or more machines manage the same disk image ofthe database, by having shared access to disk storage.

Symmetric Multiprocessing: Symmetric Multiprocessing (SMP) is a computerarchitecture that provides fast performance by making multiple CPUsavailable to complete individual processes simultaneously(multiprocessing). Unlike asymmetrical processing, any idle processorcan be assigned any task, and additional CPUs can be added to improveperformance and handle increased loads. SMP is used herein as shorthandfor a single instance of a server running on a multi-CPU machine.

SQL: SQL stands for Structured Query Language. The original versioncalled SEQUEL (structured English query language) was designed by IBM inthe 1970's. SQL-92 (or SQL/92) is the formal standard for SQL as set outin a document published by the American National Standards Institute in1992; see e.g., “Information Technology—Database languages—SQL”,published by the American National Standards Institute as AmericanNational Standard ANSI/ISO/IEC 9075: 1992, the disclosure of which ishereby incorporated by reference. SQL-92 was superseded by SQL-99 (orSQL3) in 1999; see e.g., “Information Technology—Database Languages—SQL,Parts 1-5” published by the American National Standards Institute asAmerican National Standard INCITS/ISO/IEC 9075-(1-5)-1999 (formerlyANSI/ISO/IEC 9075-(1-5) 1999), the disclosure of which is herebyincorporated by reference.

Introduction

Referring to the figures, exemplary embodiments of the invention willnow be described. The following description will focus on the presentlypreferred embodiment of the present invention, which is implemented indesktop and/or server software (e.g., driver, application, or the like)operating in an Internet-connected environment running under anoperating system, such as the Microsoft Windows operating system. Thepresent invention, however, is not limited to any one particularapplication or any particular environment. Instead, those skilled in theart will find that the system and methods of the present invention maybe advantageously embodied on a variety of different platforms,including Macintosh, Linux, Solaris, UNIX, FreeBSD, and the like.Therefore, the description of the exemplary embodiments that follows isfor purposes of illustration and not limitation. The exemplaryembodiments are primarily described with reference to block diagrams orflowcharts. As to the flowcharts, each block within the flowchartsrepresents both a method step and an apparatus element for performingthe method step. Depending upon the implementation, the correspondingapparatus element may be configured in hardware, software, firmware, orcombinations thereof.

Computer-Based Implementation

Basic System Hardware and Software (e.g., for Desktop and ServerComputers)

The present invention may be implemented on a conventional orgeneral-purpose computer system, such as an IBM-compatible personalcomputer (PC) or server computer. FIG. 1 is a very general block diagramof a computer system (e.g., an IBM-compatible system) in whichsoftware-implemented processes of the present invention may be embodied.As shown, system 100 comprises a central processing unit(s) (CPU) orprocessor(s) 101 coupled to a random-access memory (RAM) 102, aread-only memory (ROM) 103, a keyboard 106, a printer 107, a pointingdevice 108, a display or video adapter 104 connected to a display device105, a removable (mass) storage device 115 (e.g., floppy disk, CD-ROM,CD-R, CD-RW, DVD, or the like), a fixed (mass) storage device 116 (e.g.,hard disk), a communication (COMM) port(s) or interface(s) 110, a modem112, and a network interface card (NIC) or controller 111 (e.g.,Ethernet). Although not shown separately, a real time system clock isincluded with the system 100, in a conventional manner.

CPU 101 comprises a processor of the Intel Pentium family ofmicroprocessors. However, any other suitable processor may be utilizedfor implementing the present invention. The CPU 101 communicates withother components of the system via a bi-directional system bus(including any necessary input/output (I/O) controller circuitry andother “glue” logic). The bus, which includes address lines foraddressing system memory, provides data transfer between and among thevarious components. Description of Pentium-class microprocessors andtheir instruction set, bus architecture, and control lines is availablefrom Intel Corporation of Santa Clara, Calif. Random-access memory 102serves as the working memory for the CPU 101. In a typicalconfiguration, RAM of sixty-four megabytes or more is employed. More orless memory may be used without departing from the scope of the presentinvention. The read-only memory (ROM) 103 contains the basicinput/output system code (BIOS)—a set of low-level routines in the ROMthat application programs and the operating systems can use to interactwith the hardware, including reading characters from the keyboard,outputting characters to printers, and so forth.

Mass storage devices 115, 116 provide persistent storage on fixed andremovable media, such as magnetic, optical or magnetic-optical storagesystems, flash memory, or any other available mass storage technology.The mass storage may be shared on a network, or it may be a dedicatedmass storage. As shown in FIG. 1, fixed storage 116 stores a body ofprogram and data for directing operation of the computer system,including an operating system, user application programs, driver andother support files, as well as other data files of all sorts.Typically, the fixed storage 116 serves as the main hard disk for thesystem.

In basic operation, program logic (including that which implementsmethodology of the present invention described below) is loaded from theremovable storage 115 or fixed storage 116 into the main (RAM) memory102, for execution by the CPU 101. During operation of the programlogic, the system 100 accepts user input from a keyboard 106 andpointing device 108, as well as speech-based input from a voicerecognition system (not shown). The keyboard 106 permits selection ofapplication programs, entry of keyboard-based input or data, andselection and manipulation of individual data objects displayed on thescreen or display device 105. Likewise, the pointing device 108, such asa mouse, track ball, pen device, or the like, permits selection andmanipulation of objects on the display device. In this manner, theseinput devices support manual user input for any process running on thesystem.

The computer system 100 displays text and/or graphic images and otherdata on the display device 105. The video adapter 104, which isinterposed between the display 105 and the system's bus, drives thedisplay device 105. The video adapter 104, which includes video memoryaccessible to the CPU 101, provides circuitry that converts pixel datastored in the video memory to a raster signal suitable for use by acathode ray tube (CRT) raster or liquid crystal display (LCD) monitor. Ahard copy of the displayed information, or other information within thesystem 100, may be obtained from the printer 107, or other outputdevice. Printer 107 may include, for instance, a HP Laserjet printer(available from Hewlett Packard of Palo Alto, Calif.), for creating hardcopy images of output of the system.

The system itself communicates with other devices (e.g., othercomputers) via the network interface card (NIC) 111 connected to anetwork (e.g., Ethernet network, Bluetooth wireless network, or thelike), and/or modem 112 (e.g., 56K baud, ISDN, DSL, or cable modem),examples of which are available from 3Com of Santa Clara, Calif. Thesystem 100 may also communicate with local occasionally-connecteddevices (e.g., serial cable-linked devices) via the communication (COMM)interface 110, which may include a RS-232 serial port, a UniversalSerial Bus (USB) interface, or the like. Devices that will be commonlyconnected locally to the interface 110 include laptop computers,handheld organizers, digital cameras, and the like.

IBM-compatible personal computers and server computers are availablefrom a variety of vendors. Representative vendors include Dell Computersof Round Rock, Tex., Hewlett-Packard of Palo Alto, Calif., and IBM ofArmonk, N.Y. Other suitable computers include Apple-compatible computers(e.g., Macintosh), which are available from Apple Computer of Cupertino,Calif., and Sun Solaris workstations, which are available from SunMicrosystems of Mountain View, Calif.

A software system is typically provided for controlling the operation ofthe computer system 100. The software system, which is usually stored insystem memory (RAM) 102 and on fixed storage (e.g., hard disk) 116,includes a kernel or operating system (OS) which manages low-levelaspects of computer operation, including managing execution ofprocesses, memory allocation, file input and output (I/O), and deviceI/O. The OS can be provided by a conventional operating system,Microsoft Windows NT, Microsoft Windows 2000, Microsoft Windows XP, orMicrosoft Windows Vista (Microsoft Corporation of Redmond, Wash.) or analternative operating system, such as the previously mentioned operatingsystems. Typically, the OS operates in conjunction with device drivers(e.g., “Winsock” driver—Windows' implementation of a TCP/IP stack) andthe system BIOS microcode (i.e., ROM-based microcode), particularly wheninterfacing with peripheral devices. One or more application(s), such asclient application software or “programs” (i.e., set ofprocessor-executable instructions), may also be provided for executionby the computer system 100. The application(s) or other softwareintended for use on the computer system may be “loaded” into memory 102from fixed storage 116 or may be downloaded from an Internet location(e.g., Web server). A graphical user interface (GUI) is generallyprovided for receiving user commands and data in a graphical (e.g.,“point-and-click”) fashion. These inputs, in turn, may be acted upon bythe computer system in accordance with instructions from OS and/orapplication(s). The graphical user interface also serves to display theresults of operation from the OS and application(s).

Client-Server Database Management System

While the present invention may operate within a single (standalone)computer (e.g., system 100 of FIG. 1), the present invention ispreferably embodied in a multi-user computer system, such as aclient/server system. FIG. 2 illustrates the general structure of aclient/server database system 200 suitable for implementing the presentinvention. (Specific modifications to the system 200 for implementingmethodologies of the present invention are described in subsequentsections below.) As shown, the system 200 comprises one or moreclient(s) 210 connected to a server 230 via a network 220. Specifically,the client(s) 210 comprise one or more standalone terminals 211connected to a database server system 240 using a conventional network.In an exemplary embodiment, the terminals 211 may themselves comprise aplurality of standalone workstations, dumb terminals, or the like, orcomprise personal computers (PCs) such as the above-described system100. Typically, such units would operate under a client operatingsystem, such as a Microsoft® Windows client operating system (e.g.,Microsoft® Windows 95/98, Windows 2000, or Windows XP).

The database server system 240, which comprises Sybase® Adaptive Server®Enterprise (available from Sybase, Inc. of Dublin, Calif.) in anexemplary embodiment, generally operates as an independent process(i.e., independently of the clients), running under a server operatingsystem such as Microsoft® Windows NT, Windows 2000, or Windows XP (allfrom Microsoft Corporation of Redmond, Wash.), UNIX (Novell), Solaris(Sun), or Linux (Red Hat). The network 220 may be any one of a number ofconventional network systems, including a Local Area Network (LAN) orWide Area Network (WAN), as is known in the art (e.g., using Ethernet,IBM Token Ring, or the like). The network 220 includes functionality forpackaging client calls in the well-known Structured Query Language (SQL)together with any parameter information into a format (of one or morepackets) suitable for transmission to the database server system 240.

Client/server environments, database servers, and networks are welldocumented in the technical, trade, and patent literature. For adiscussion of Sybase®-branded database servers and client/serverenvironments generally, see, e.g., Nath, A., “The Guide to SQL Server”,Second Edition, Addison-Wesley Publishing Company, 1995. For adescription of Sybase® Adaptive Server® Enterprise, see, e.g., “AdaptiveServer Enterprise 15.0 Collection: (1) Core Documentation Set and (2)Installation and Configuration,” available from Sybase, Inc. of Dublin,Calif. This product documentation is available via the Internet (e.g.,currently at sybooks.sybase.com). The disclosures of the foregoing arehereby incorporated by reference.

In operation, the client(s) 210 store data in, or retrieve data from,one or more database tables 250, as shown at FIG. 2. Data in arelational database is stored as a series of tables, also calledrelations. Typically resident on the server 230, each table itselfcomprises one or more “rows” or “records” (tuples) (e.g., row 255 asshown at FIG. 2). A typical database will contain many tables, each ofwhich stores information about a particular type of entity. A table in atypical relational database may contain anywhere from a few rows tomillions of rows. A row is divided into fields or columns; each fieldrepresents one particular attribute of the given row. A rowcorresponding to an employee record, for example, may includeinformation about the employee's ID Number, Last Name and First Initial,Position, Date Hired, Social Security Number, and Salary. Each of thesecategories, in turn, represents a database field. In the foregoingemployee table, for example, Position is one field, Date Hired isanother, and so on. With this format, tables are easy for users tounderstand and use. Moreover, the flexibility of tables permits a userto define relationships between various items of data, as needed. Thus,a typical record includes several categories of information about anindividual person, place, or thing. Each row in a table is uniquelyidentified by a record ID (RID), which can be used as a pointer to agiven row.

Most relational databases implement a variant of the Structured QueryLanguage (SQL), which is a language allowing users and administrators tocreate, manipulate, and access data stored in the database. The syntaxof SQL is well documented; see, e.g., the above-mentioned “AnIntroduction to Database Systems”. SQL statements may be divided intotwo categories: data manipulation language (DML), used to read and writedata; and data definition language (DDL), used to describe data andmaintain the database. DML statements are also called queries. Inoperation, for example, the clients 210 issue one or more SQL commandsto the server 230. SQL commands may specify, for instance, a query forretrieving particular data (i.e., data records meeting the querycondition) from the database table(s) 250. In addition to retrieving thedata from database server table(s) 250, the clients 210 also have theability to issue commands to insert new rows of data records into thetable(s), or to update and/or delete existing records in the table(s).

SQL statements or simply “queries” must be parsed to determine an accessplan (also known as “execution plan” or “query plan”) to satisfy a givenquery. In operation, the SQL statements received from the client(s) 210(via network 220) are processed by the engine 260 of the database serversystem 240. The engine 260 itself comprises a parser 261, a normalizer263, a compiler 265, an execution unit 269, and an access methods 270.Specifically, the SQL statements are passed to the parser 261 whichconverts the statements into a query tree—a binary tree data structurewhich represents the components of the query in a format selected forthe convenience of the system. In this regard, the parser 261 employsconventional parsing methodology (e.g., recursive descent parsing).

The query tree is normalized by the normalizer 263. Normalizationincludes, for example, the elimination of redundant data. Additionally,the normalizer 263 performs error checking, such as confirming thattable names and column names which appear in the query are valid (e.g.,are available and belong together). Finally, the normalizer 263 can alsolook-up any referential integrity constraints which exist and add thoseto the query.

After normalization, the query tree is passed to the compiler 265, whichincludes an optimizer 266 and a code generator 267. The optimizer 266 isresponsible for optimizing the query tree. The optimizer 266 performs acost-based analysis for formulating a query execution plan. Theoptimizer will, for instance, select the join order of tables (e.g.,when working with more than one table), and will select relevant indexes(e.g., when indexes are available). The optimizer, therefore, performsan analysis of the query and selects the best execution plan, which inturn results in particular access methods being invoked during queryexecution. It is possible that a given query may be answered by tens ofthousands of access plans with widely varying cost characteristics.Therefore, the optimizer must efficiently select an access plan that isreasonably close to an optimal plan. The code generator 267 translatesthe query execution plan selected by the query optimizer 266 intoexecutable form for execution by the execution unit 269 using the accessmethods 270.

All data in a typical relational database system is stored in pages on asecondary storage device, usually a hard disk. Typically, these pagesmay range in size from 1 Kb to 32 Kb, with the most common page sizesbeing 2 Kb and 4 Kb. All input/output operations (I/O) against secondarystorage are done in page-sized units—that is, the entire page isread/written at once. Pages are also allocated for one purpose at atime: a database page may be used to store table data or used forvirtual memory, but it will not be used for both. The memory in whichpages that have been read from disk reside is called the cache or bufferpool.

I/O to and from the disk tends to be the most costly operation inexecuting a query. This is due to the latency associated with thephysical media, in comparison with the relatively low latency of mainmemory (e.g., RAM). Query performance can thus be increased by reducingthe number of I/O operations that must be completed. This can be done byusing data structures and algorithms that maximize the use of pages thatare known to reside in the cache. Alternatively, it can be done by beingmore selective about what pages are loaded into the cache in the firstplace. An additional consideration with respect to I/O is whether it issequential or random. Due to the construction of hard disks, sequentialI/O is much faster then random access I/O. Data structures andalgorithms encouraging the use of sequential I/O can realize greaterperformance.

For enhancing the storage, retrieval, and processing of data records,the server 230 maintains one or more database indexes 245 on thedatabase tables 250. Indexes 245 can be created on columns or groups ofcolumns in a table. Such an index allows the page containing rows thatmatch a certain condition imposed on the index columns to be quicklylocated on disk, rather than requiring the engine to scan all pages in atable to find rows that fulfill some property, thus facilitating quickaccess to the data records of interest. Indexes are especially usefulwhen satisfying equality and range predicates in queries (e.g., a columnis greater than or equal to a value) and “order by” clauses (e.g., showall results in alphabetical order by a given column).

A database index allows the records of a table to be organized in manydifferent ways, depending on a particular user's needs. An index keyvalue is a data quantity composed of one or more fields from a recordwhich are used to arrange (logically) the database file records by somedesired order (index expression). Here, the column or columns on whichan index is created form the key for that index. An index may beconstructed as a single disk file storing index key values together withunique record numbers. The record numbers are unique pointers to theactual storage location of each record in the database file.

Indexes are usually implemented as multi-level tree structures,typically maintained as a B-Tree data structure. Pointers to rows areusually stored in the leaf nodes of the tree, so an index scan mayentail reading several pages before reaching the row. In some cases, aleaf node may contain the data record itself. Depending on the databeing indexed and the nature of the data being stored, a given key mayor may not be intrinsically unique. A key that is not intrinsicallyunique can be made unique by appending a RID. This is done for allnon-unique indexes to simplify the code for index access. The traversalof an index in search of a particular row is called a probe of theindex. The traversal of an index in search of a group of rows fulfillingsome condition is called a scan of the index. Index scans frequentlylook for rows fulfilling equality or inequality conditions; for example,an index scan would be used to find all rows that begin with the letter“A”.

Although client/server database systems remain in wide use, the presentinvention, in its currently preferred embodiment, is implemented in aShared Disk Cluster database system environment, which provides severaladvantages compared to prior art client/server database systems. Amongthese advantages are that a Shared Disk Cluster system can be moreeasily expanded. Users may easily add additional servers (nodes) inorder to increase system capacity and provide improved performancewithout major data restructuring and the associated system downtime.This also enables users to purchase hardware in smaller increments asneeded to keep up with growth. Other advantages of a Shared Disk Clusterarchitecture include lower total cost of ownership (TCO), continuousavailability, high performance, and single system presentation.

FIG. 3 is a high-level block diagram of a Shared Disk Cluster databasesystem environment 300 illustrating an example of a four node (servers311, 312, 313, 314) cluster accessed by a plurality of clients (clients301, 302, 303). The term “cluster” refers to a collection of more thanone networked (and usually homogeneous) nodes, which function as asingle system. Each node generally contains its own CPU and memoryresources. The term “clustered server” refers to a database server(currently implemented using Sybase® Adaptive Server® Enterprise (“ASE”)available from assignee Sybase of Dublin, Calif.) which runs on acluster (cluster DB 330) and jointly manages a single installation ofthe databases on the shared disk storage 335. As shown, the environment300 also includes a quorum disk 339. The quorum disk 339 is a shareddisk device used for cluster membership arbitration. The quorum diskalso maintains a history of runtime cluster view changes.

A Shared Disk Cluster database system can be implemented using low cost“blade servers” such as Intel/Linux machines. In the presently preferredembodiment, nodes in the cluster communicate with each other throughprivate interconnects (e.g., private interconnect 325). As shown at FIG.3, the nodes are interconnected via redundant high-speed interconnectswith each node also having a direct connection to all databases on adisk subsystem. Gigabit Ethernet and Infiniband may be used to providethese high-speed interconnects. The storage subsystem may be implementedusing raw device support with a storage area network (SAN 329) or withfile system support (e.g., through use of a clustered file system suchas those from Veritas or Polyserv).

The above-described computer hardware and software are presented forpurposes of illustrating the basic underlying desktop and servercomputer components that may be employed for implementing the presentinvention. For purposes of discussion, the following description willpresent examples in which it will be assumed that there exist multipleserver instances (e.g., database server nodes) in a cluster thatcommunicate with one or more “clients” (e.g., personal computers ormobile devices). The present invention, however, is not limited to anyparticular environment or device configuration. Instead, the presentinvention may be implemented in any type of system architecture orprocessing environment capable of supporting the methodologies of thepresent invention presented in detail below.

Overview of Assignment of Unique Identifiers in Distributed Environment

An efficient system for assigning unique identifiers in a distributedsystem environment is described. The system and methodology of thepresent invention provides for distributing unique identifiers amongnodes of the distributed system in a manner which ensures that there isreduced messaging among the nodes in the network when identifierallocation is in progress. Also, the approach provides the option forreuse of values for node(s) leaving the network, and for redistributionof identifiers when there are no free lists available.

In a system constructed in accordance with the present invention,information about the free lists of unique identifiers is available atall the nodes. Copies of the same free list are effectively stored inall the nodes. Any node which wants to grab some identifiers will notneed to ask any other node for these identifiers since each node has itsown copy of the free list of identifiers stored locally and can allocatefrom that copy. Once a node allocates some identifiers for itself, itwill update other nodes about the allocation.

At a high level, the system and methodology for allocation of uniqueidentifiers proceeds as follows. First, the allocating node obtains theright to update the identifier lists through a distributed lockingmechanism. Second, having received the right to update the identifierlists, the allocating node allocates for itself a set of identifiersfrom a free list of identifiers and updates the free list of identifiersto reflect the new allocation. Having allocated for itself a new set ofidentifiers and updated the free list of identifiers, the allocatingnode sends the updated free list of identifiers to all the other nodes.Upon receiving the updated free list of identifiers, each of the othernodes updates its own respective copy of the free list of identifiers.Finally, the allocating node relinquishes the right to update (which wasreceived in the initial step).

If a node leaves the system, a mechanism is employed to use the valuesleft unused by the leaving node. This leads to the feature that even ifall the nodes leave the system, the system ensures that the unusedidentifiers are not lost and will be reused. The identifiers may be(optionally) redistributed among the nodes when the free list ofidentifiers gets exhausted and no free identifiers remain. All told, theapproach provides new methods for efficiently distributing theidentifiers among the nodes, which improves scalability and reducesmessaging between nodes. Before describing the operations of the presentinvention in more detail, some terms and concepts that are used in thefollowing discussion will be explained.

Introduction to Unique Identifiers

Use of Unique Identifiers: Identity Column

Identity values in an identify column of a table need to be unique forevery row in the table, so as to uniquely identify a particular rowwithin a given table. A table descriptor (i.e., DES) maintains the lastidentity value assigned. This value is incremented whenever a row isadded, thus maintaining the uniqueness property. (In a SMP serverenvironment, there is only one copy of the DES and hence synchronizationat the DES level suffices.)

In a distributed environment, such as a Shared Disk Cluster environmentas illustrated at FIG. 3, DES is instantiated at multiple cluster nodeswhich will be inserting rows in a table. In a cluster environment, theuniqueness of the identity value in an identity column needs to bemaintained amongst nodes the cluster that are sharing access to datamaintained by the database. Therefore, synchronization has to beperformed at the cluster level across the various instantiations of theDES to ensure the property of uniqueness. A naive approach to ensureuniqueness is to define an “OCM lock” for the last identity value.Whenever a node wants to insert a row it grabs the lock in exclusivemode. However, it is very inefficient to grab a lock for every insert toa table.

The present invention addresses the fundamental problem of how aclustered server node can obtain an identity value which is uniqueacross the distributed environment. The present invention provides asolution which has minimal performance overhead and at the same time issimple to implement. The solution also efficiently handles nodes leavingthe cluster, nodes joining the cluster, failure situations and the like.

Identity Gaps

By default, the database system in which the present invention iscurrently implemented (Sybase® Adaptive Server® Enterprise, availablefrom Sybase, Inc. of Dublin, Calif.) allocates a block of identityvalues (unique identifiers) in memory instead of writing each identifier(identity value) to disk as it is needed (which would require moreprocessing time). In doing so, the system writes the highest number ofeach block to the table's “sysobjects” (system objects) row. This numberis used as the starting point for the next block after the currentlyallocated block of numbers is used or “burned”. The other numbers of theblock are held in memory, but are not saved to disk. Numbers areconsidered burned when they are allocated to memory, then deleted frommemory either because they were assigned to a row, or because they wereerased from memory due to some abnormal occurrence such as a systemfailure. Examples of such system failure are “shutdown with nowait” andthe like.

Allocating a block of unique identifiers (identity values) improvesperformance by reducing contention for the sysobjects row of the table.However, some problems remain. For example, if the system fails or isshut down with no wait before all the identifiers are assigned, theunused identifiers are burned (i.e., lost). For example, assume that ablock of 1000 unique identifiers is obtained. When the system is runningagain, it starts numbering with the next block of numbers based on thehighest number of the previous block that was written to disk. Forexample, if the first 1000 unique identifiers has been burned, anotherblock (e.g., 1001 to 2000) will be obtained from the disk and broughtinto memory and so on and so forth. Depending on how many allocatednumbers were assigned to rows before the failure, one may have a largegap in the identity values.

In the presently preferred embodiment, several parameters are providedto control these identity block sizes:

1. Identity burning set factor: This is a server-wide setting andessentially represents a default identity gap setting if the identitygap for a particular table is not specified. The burning set factorindicates a percentage of the total available identifiers (identityvalues) one wants to have allocated for each block. This setting is thesame for all tables.

2. identity_gap: This is a table-specific setting. This settingoverrides the burning set factor for a specific table so as to createidentifier (identity value) blocks of a specific size for the table. Theidentity_gap setting overrides the identity burning set factor for thetable.

3. identity grab size: Also a server-side setting which reserves a blockof contiguous identifier numbers for each process/task. For every objecta process/task will acquire blocks of size=“identity grab size”. Thissetting works with the identity_gap and identity burning set settings.In operation, an identity burning set (or identity_gap) number ofidentity values (IDs) are allocated to the server and information aboutthis allocation is kept in DES. When a given process/task requiresidentity values, an “identity grab size” number of the allocated IDs(identity values) is assigned to the process/task as described below inmore detail. For example, a block of 20 identifiers may be assigned to atask requesting identity values if the identity grab size is set to 20.Assigning these values in blocks (rather than one by one) reducescontention amongst tasks at a given node.

As described above, the parameter identity_grab_size is associated witha given task/process which can reside on any node of the cluster. Theidentity gap and identity burning set parameters have the samecharacteristics except that one is a server-wide setting and the otheris table-specific setting. In the discussions that follow, references toidentity gaps may refer to either the identity_gap setting or theidentity burning set factor (unless otherwise indicated by the context).It should also be noted that the identity_gap setting has an impact onthe number of identity values that may be lost in the event of a clusteror node crash. Generally, in the case of cluster/node crash, one maylose a maximum of the identity_gap number of identity values. Theidentity gap also has performance implications. If the identifiers aretaken from disk in smaller blocks, one may have to access the disk morefrequently to obtain needed identity values.

It should be observed that because of caching of identity values and theidentity grab size parameter, identity columns in a table need not bemonotonically increasing. In the currently preferred embodiment, theyare monotonically increasing for a particular process, such as, forexample, isql (interactive SQL) session. However, multiple processes caninsert rows in any order as only uniqueness for the identity columns isneeded.

Distributed System Environment

FIG. 4 is a high level block diagram illustrating a distributed systemenvironment 400 in which the system and methodology of the presentinvention may be implemented. As shown, a cluster of four networkedserver nodes (411, 412, 413, 414) share access to data on shared diskstorage 435. As also shown at 425, the nodes are interconnected and eachnode has a direct connection to all databases on the shared disk storage435. An identity manager (identity mgr) module at each node of thecluster as shown at 451, 452, 453, 454 at FIG. 4 implements methodologyof the present invention. The identity manager manages and maintainsinformation about identifiers allocated at each node of the cluster ashereinafter described in detail. The identity manager module at a givenserver node interacts with its counterparts at other nodes and writesinformation to, and retrieves information from the shared disk storage435 and identity data structures at each of the nodes as shown at 461,462, 463, 464. As shown at FIG. 4, the information maintained at eachnode includes an identity data structure 461, 462, 463, 464, a clustercurrent value as shown at 471, 472, 473, 474, and a cluster maximumvalue as shown at 481, 482, 483, 484. This information is protected by adistributed locking mechanism (OCM lock), which is not separately shownat FIG. 4

In operation, when identifiers (identity values) are needed at nodes ofthe cluster, one of the identity managers (e.g., identity manager 451 atnode 411) would obtain block of identifiers from the sysobjects row of adatabase table stored on shared disk storage 435. This block could beobtained on initial start-up of the system, but a more typical case inwhich identifiers would be requested is when a prior block ofidentifiers has been used or “burned”. Assume for example, a block of1000 identifiers is obtained from sysobjects row by the identity manager451. As part of this operation, identity manager 451 informs the otheridentity manager modules (e.g., 452, 453, 454) and the allocation ofthis block of identifiers is recorded at all of the nodes.

The identity manager then divides up the block of identifiers so thatidentifiers could be provided for use at a given node in response todemand. In the presently preferred embodiment, an “equal sharing clusterapproach” method is utilized, as described in more detail below. Forinstance, using this same example of a block of 1000 identifiers and afour node cluster, the block of 1000 identifiers is divided into fourblocks of 250 each. Although an equal sharing approach is utilized inthe currently preferred embodiment, those skilled in the art willappreciate that various other methods may be used to allocate the blocksof identifiers amongst nodes of the cluster. Although the identifiersare divided into four blocks of 250, the approach utilized in thecurrently preferred embodiment is to provide these blocks to a givennode on demand. In other words, identity values are assigned to a givennode, as needed and are not pre-allocated to each of the nodes in equalquantities. This allows for the possibility that some of the nodes willlikely use more (or less) blocks than others (e.g., one node could, forexample, use the entire block of 1000 identifiers if it was the mostactive).

Identifiers allocated for use at a given node are then made available totasks at the node needing them for performing operations. This may beillustrated by example. Assume, for instance, that a given task (notseparately shown at FIG. 4) at node 411 requires identity values inorder to perform operations. The task would interact with the identitymanager module 451 at node 411 in order to obtain the needed identifiers(identity values). If necessary (i.e., if there were not sufficientidentifiers already allocated for use at that particular node), identitymanager 451 would allocate a block of identifiers (e.g., 250identifiers) for use at node 411 and would update the identity managersat the other nodes (i.e., identity managers 452, 453, 454) as to thisallocation. Information about the allocation is also stored at each ofthe nodes (e.g., by updating the identity data structure 461 and clustercurrent 471 at node 411 and in corresponding structures at other nodes).If identifiers are available at the node and if the above-describedidentity_grab_size parameter is set, the task will receive theidentity_grab_size number of identifiers. For example, if theidentity_grab_size parameter is set to 20, the task would receive ablock of 20 identifiers from the identity manager 451. This allocationwould be tracked locally by the identity manager 451 at node 411;however it would not be communicated to the other nodes (e.g., nodes412, 413, 414 in this example). Before describing these operations ofthe present invention in allocating unique identifiers in greaterdetail, some design considerations and approaches will be described.

Design Considerations

There are various considerations that were observed in designing anapproach to for assigning unique identifiers in a Shared Disk Clusterdistributed system environment. These considerations include thefollowing:

1. There should not be too much contention amongst nodes of the clusterto obtain identity values.

2. There should not be too much contention at a clustered server node toobtain identity values.

3. In case of DES getting scavenged in a cluster node, the situationshould be handled gracefully. Care should be taken to not lose identityvalues allocated to the node. (Similar situation when a cluster node isshutdown.)

4. A new node joining or leaving the cluster should be impactedminimally.

5. In case of cluster shutdown (or all nodes in a cluster failing), thenumber of identity values lost should be as few as possible.

6. Failover handling should be addressed.

7. When a given node needs identity values and all identity values havebeen allocated previously in the cluster, then the node needs to grabunused identity values from other nodes. The period in which the nodegrabs nodes from other nodes is known as “grabbing period” and if a nodejoins/leaves the cluster during this period or there is a DES scavenge,the solution should be able to handle it.

8. Cluster crash recovery should be possible.

Design Strategy

Two alternative approaches or strategies are available to assign uniqueidentity values in a distributed cluster system environment which arereferred to herein as: a) Cluster caching identity scheme, and b)Non-cluster caching identity scheme. Both strategies are based on theassumption that the identity values need to be unique, but do notnecessarily need to be monotonically increasing. The present inventionuses the cluster caching identity scheme as it provides severaladvantages, including that it reduces contention on the sysobjects rowof database table(s). In typical operation, the sysobjects row is a hotspot for a table and is frequently accessed for a number of reasonswhich are unrelated to the identity column. The fact that the clustercaching identity scheme has less sysobjects row contention is a majorbenefit of the cluster caching scheme.

The performance overhead of can be evaluated on the basis of twoparameters:

(1) Number of accesses to sysobjects row When a new set of identityvalues is needed, a new set is burned (with an update of the entry insysobjects). The field objidentburnmax in des→dobjectc is protected by adistributed lock (referred to as an OCM lock). Any update to this fieldincludes the following steps: a) Take an exclusive (EX) lock on the OCMlock; b) Push the updated value to all the other nodes; c) Log theupdate; and d) Release the exclusive (EX) lock.

(2) Number of exclusive (EX) locks needed. Every scheme might need itsown set of OCM locks for implementing it; the locking overheadassociated with those locks will impact the performance of the system.

The cluster caching identity scheme of the currently preferredembodiment will next be described. It should be noted that in thefollowing discussion, some examples are used in which unique identifiersare issued in monotonically increasing sequences. However, it should beunderstood that this is not required by the present invention.

In the cluster caching identity scheme, the following high-levelapproach is adopted: allocate IDs (identity values) from the sysobjectsrow in blocks (chunks) of the size identity_gap and use them in thecluster until the values are exhausted. The first node doing an insertwill obtain an allocation of identity_gap number of identity values(IDs) from the sysobjects row. For example, a block of uniqueidentifiers from 1 to 1000 may be allocated for use in the cluster. Thisis stored in memory at each of the nodes of the cluster as describedbelow in more detail. These allocated identity values are reserved foruse in the cluster, but at this point are not yet assigned to particularnode(s) of the cluster.

There are various considerations that one should observe in selectingthe identity_gap setting. One consideration is the identity_gap numberof unique identifiers which are obtained from the disk and cached inmemory affects the maximum number of identifiers that are at risk in theevent of a system crash. For example, if the identifiers are obtainedfrom disk in smaller blocks (e.g., 100) only a small number would beburned in the event of a crash rather than the larger number at risk ifthe identifiers are obtained in larger blocks (e.g., 1000). Theidentity_gap setting also has performance implications. Suppose, forexample, a given task has to allocate 950 identity values. If theidentifiers are taken from disk in blocks of 100, the system may have toaccess the disk ten times in order to obtain the required identityvalues. However, with a larger identity_gap setting of 1000, the diskneed only be accessed once (thereby reducing number of times diskaccessed and improving performance).

The identity_gap number of values allocated for use in the cluster aremade available for use at particular node(s) in response to demand fromsuch node(s). Those skilled in the art will appreciate that a number oftechniques may be utilized for allocating the identity values amongstthe nodes. In the presently preferred embodiment, the general approachis to divide the block of values allocated for use in the cluster intosmaller blocks or units using a methodology which is referred to as an“equal sharing” approach. However, these smaller blocks are onlyallocated for use at particular node(s) of the cluster in response todemand from such node(s). When there are no available identity valuesleft in the cluster, another identity_gap number of identifiers isallocated from the sysobjects row and so on. For example, after thefirst 1000 unique identifiers has been burned, another block (e.g., IDs1001 to 2000) will be obtained and brought into memory for use in thecluster and so on and so forth. These operations of the presentinvention are described below in more detail.

Equal-Sharing Cluster Approach

The present invention, in its currently preferred embodiment, includesan “equal sharing” allocation methodology to distribute the identity_gapnumber of values allocated for use in the cluster. As noted above, thesystem will grab identity_gap number of values from the sysobjects rowon behalf of the cluster. Assume, for example, the system may grab 1000identity values (i.e., based on the identity gap being set to 1000). Aparameter called equal_share_value is defined which is equal to(identity_gap/no._of_nodes). If equal_share_value becomes less than 1,equal_share_value is set to 1. The system essentially divides theidentity_gap number of values into smaller blocks of a size ofequal_share_value. For example, if the identity gap is set to 1000 andthere are four nodes in the cluster, equal_share_value will be 250. Notethat these smaller blocks of identifiers are not immediately allocatedto the nodes of the cluster. Rather, they are allocated in response todemand from the nodes. In other words, the overall identity_gap numberof cached values is divided into chunks of size equal_share_value andallocated to the nodes on a demand basis.

Assume that I is the identity_gap and N is the number of nodes. Then,every node gets (I/N) number of identity values from the cached valuesavailable with the cluster when it needs identity values. Using the sameexample above of an identify gap of 1000 and a four node cluster, eachnode would obtain 250 identity values when it needed additional identityvalues.

For implementing the present invention, several parameters are alsomaintained to track the free identifiers that are available for use inthe cluster. For every node in the cluster, a node current value(curr_val) and a node maximum value (node_max) are maintained. Inaddition, a cluster current value (cluster_current) and a clustermaximum value (cluster_max) are also maintained as cluster-wide values.Cluster_current stores the information about the start of the next chunk(block) that can be allocated to a node and cluster_max stores themaximum identity value assigned to the cluster. Thus, at every point oftime (cluster_max−cluster_current) identity values are available in thecluster and not yet allocated to any of the nodes. For example, ifcluster_max is 2000 and cluster_current is 1500, that indicates thatthere are 500 identity values available for use by nodes in the clusterthat have not yet been allocated to any node. The node current value(curr_val) and node maximum (node_max) values are maintained and usedlocally at each node. For example, assume identity values 251 to 500 areassigned (allocated) to a given node. The curr_val (current value) isused to keep track of the identity value which has last been used at thenode. For example, if the current value is 400 and the max value is 500,then the node still has 100 identity values available for use(node_max−curr_val). When curr_val is equal to node_max, no moreidentifiers are available at the node. In such event the node wouldrequest an additional block of identifiers as described herein.

The curr_val and node_max value for every node along withcluster_current and cluster_max are part of a common data structure(identity data structure) which is protected by an OCM lock. The OCMlock is a distributed lock that synchronizes access to this datastructure. A node updating the data structure will acquire the OCM lock,make its updates to the values (e.g., to obtain additional identityvalues for use by the updating node), push the updated values to theother nodes of the cluster, and then release the lock. Also, for everynode state information is maintained which denotes the state of thevalues, i.e., whether they are allocated or free. This common structure(described below in more detail) is part of DES and hence it isinstantiated on all nodes in the cluster where the DES is instantiated.Curr_val for every node is needed to be a part of this common structureto handle the DES Scavenge issue as explained later. After every updateof this data structure, the updated values are pushed by the updater tothe other nodes in the cluster.

A node can continue its inserts on the table as long as it has freeidentity values for itself, i.e., its des→curr_val is less thannode_max. It does not need to have any lock on the common identitystructure for doing the inserts. This reduces the number exclusive locksneeded.

The following pseudocode illustrates the operations of the identitymanager at a given node n when it finishes its set of values (i.e.,des→curr_val becomes equal to node_max):

 1: Take an exclusive (EX) lock on the identity structure.  2: Ifcluster_current < cluster_max, then  3: {  4:  If (cluster_current +equal_share_value) < = cluster_max, then  5:  {  6:   a) node_max (n) =cluster_current + equal_share_value  7:   Node_current(n)=cluster_current;  8:   b) cluster_current + =equal_share_value  9:  } 10:  else 11:  { 12:   a) node_max (n)=cluster_max; 13:    Node_current(n)=cluster_current; 14:   b)cluster_current=cluster_max; 15:  } 16:  else 17:  { 18:   Grab a newset of identity_gap values from the sysobjects row      and 19:  allocate it for the cluster. Also allocate equal_share_value values20:   to node n, and update cluster_current and cluster_max. 21:   Setcurrent_node val state as ALLOCATED. 22:  } 23: } 24: Push the updatedvalues to the other nodes in the cluster using 25:   callback mechanism26: Release the exclusive (EX) lock.

If cluster_current is less than cluster_max at line 2 above, thenidentifiers are available in the cluster and additional values do notneed to be obtained from the sysobjects row. Otherwise (i.e., ifcluster_current is equal to cluster_max) additional values are needed.In this case, the else condition at line 16 applies and a new set ofidentifiers is grabbed from the sysobjects row and allocated for use inthe cluster. As part of this operation, equal_share_value identifiersare allocated to node n, and cluster_current and cluster_max areupdated. When the necessary steps are completed, the updated values arepushed (replicated) to other nodes in the cluster as shown at lines24-25 and the exclusive lock is released as shown at line 26.

Identity Grab Size

The identity manager at a given node also is responsible for providingidentity values allocated at the node to tasks needing them. If theidentity_grab_size parameter is set, then for every task needingidentifiers the identity manager reserves identity_grab_size number ofidentifiers (IDs) as illustrated by the following:

 1: If (node_max − node_curr) >= (grab_size) then  2: {  3: pss.idtval.start= node_curr;  4:  node_curr + = grab_size;  5: }  6:Else  7: {  8:  Take exclusive (EX) lock  9:  if (cluster_max −cluster_current) >= (grab_size) then 10:  { 11:   pss.idtval.start=cluster_current; 12:   cluster_current + = grab_size; 13:  } 14:  else15:  { 16:   grab from sysobjects, allocate a new set of values for thecluster and further allocate grab size number of values for the currenttask. 17:   [ Amount to grab is (identity_gap) + ((grab_size) −(cluster_max − cluster_current)) ] 18:   pss.idtval.start=cluster_current; 19:   cluster_current + = grab_size; 20:  (node valstate is unchanged) 21:  } 22: } 23: Push the updated values to theother nodes in the cluster 24: Release exclusive (EX) lock

As illustrated above at line 1, if the number of identifiers available(node_max−node_curr) is greater than or equal to the grab_size, thensufficient identifiers are available at the node. (Note that node_curris equivalent to curr_val discussed above.) In this case, grab_sizenumber of identifiers are provided to the task and node_curr isincremented by grab_size to indicate the number of identifiers given tothe task for its use. However if there are not a sufficient number ofidentifiers remaining available locally (i.e., previously allocated foruse at that node), then the else condition at line 6 applies and stepsare taken to obtain additional identifiers from the cluster level. Ifcluster_max−cluster_current is greater than or equal to the grab_size asshown at line 9 of the above routine, then values are available at thecluster level and are allocated to the node as provided at lines 11-12.Otherwise, if sufficient values are not available, the else condition atline 14 applies and the additional identifiers are obtained from thesysobjects row as shown above at lines 16-20. These operations aredescribed below in more detail.

FIG. 5A comprises a flowchart illustrating methodology 500 of thepresent invention for allocation of identifiers to tasks at a givennode. When a task needs identifiers for performing operations, the taskinteracts with the local identity manager at the node to obtain identityvalues. For example, if a given task T1 requires identifiers, itrequests the identifiers from the local identity manager as provided atstep 501. In response, the identity manager initially evaluates whetheridentity values are available locally at decision step 502. If thenumber of identifiers available (i.e., allocated for use at the node) isgreater than or equal to the grab_size, then the method proceeds to step504; otherwise, it proceeds to step 503. As previously discussed, thenumber of identifiers available at a given node is equal to the nodemaximum value minus the node current value (i.e., node_max−node_curr).If, for example, a given task T1 at the node needs 10 values(grab_size=10) and node_max−node_curr is 100, then identifiers areavailable at the node and the method proceeds to step 504. However, inother instances there may not be sufficient identifiers availablelocally at the node to satisfy the request. For example, a task T2running at the same node may need 100 identifiers (i.e., grab_size=100)for performing particular operation(s). Assume, however, there are only50 identity values available (i.e., node_max−node_curr=50) at the nodeat the time of the request. In this case the identity manager proceedsto obtain the needed identifiers as shown at step 503. At step 504,grab_size identifiers are given to the task by the identity manager andnode_current is also increased by grab_size.

FIG. 5B comprises a flowchart 503 (corresponding to step 503 of FIG. 5A)illustrating methodology of the present invention for allocation ofidentifiers to a given node in response to demand for identifiers atsuch node. When additional identifiers are needed at a given node, anexclusive lock (OCM lock) is acquired on the identity data structure atstep 511. Next, at decision step 512 a check is made to determine ifidentifiers are available at the cluster level (i.e., have beenpreviously obtained from the sysobjects row and have not yet beenallocated to the node(s)). If sufficient identifiers are not availableat the cluster level, the method proceeds to step 513 to obtainadditional identifiers from disk. Otherwise, if identifiers areavailable at the cluster level, the method proceeds to step 514 toallocate available identifiers to the node. Entries in the identity datastructure are also updated to reflect this allocation of identifiers tothe node. After the above steps have been completed, the updated valuesin the identity structure are pushed to other nodes in the cluster asprovided at step 515. At step 516 the OCM lock on the identity structureis released.

FIG. 5C comprises a flowchart 513 (corresponding to step 513 of FIG. 5B)illustrating methodology of the present invention for an identitymanager at a given node to obtain an additional block of identifiersfrom disk for use in the cluster. When additional values are needed inthe cluster, at step 521 a new block of identifiers is grabbed from thesysobjects row and allocated for use in the cluster. At step 522,identifiers are allocated to the node needing them. In the currentlypreferred embodiment, equal_share_values are generally allocated to thenode needing additional identifiers (subject to special handling in somecases as described below). At step 523, cluster_current and cluster_maxand other identity data structure entries are updated. After thenecessary steps are completed to obtain identifiers for use in thecluster and provide a chunk (block) of identifiers to the requestingnode, updated values in the identity structure are pushed (replicated)to other nodes in the cluster and the OCM lock on the identity structureis released as described above and illustrated at steps 514, 515, 516 atFIG. 5B.

Context Migration

From time to time a task running on one node may migrate to another nodeof the cluster. This may be illustrated by an example referring again toFIG. 4. Assume, for instance, a task T3 (not separately shown at FIG. 4)on node 413 may migrate from node 413 to node 414 of the cluster. Alsoassume that 20 identity values had previously been allocated to Task T3(i.e., 20 values cached at the task level) on node 413. The approach ofthe present invention is to allow the task (e.g., Task T3 in thisexample) to retain its cached identity values and reuse them inperforming operations after the migration to another node in the cluster(e.g., at node 414 in this example). Context migration support is alsoappropriate as the identifiers grabbed as part of identity_grab_size aretask specific and part of PSS. When a task is being migrated to anothernode, it is also better to try to migrate the unused identity values tothe other node so as to avoid gaps. The identity manager provides threefunctions to support context migration:

a) to check if the task is migratable or not. This returns TRUE alwayssince reusing identity values at other nodes is possible (since alreadyreserved for that task).

b) to package the data on source node. The migration module calls thisfunction on the source node before migration. This function can packageall required control info and data, which is delivered to it on thedestination node. The range of unused values may be packaged inpss→idtval in this function.

c) to unpackage the data on destination node. After a task issuccessfully migrated, this function is invoked to unpackage the datareceived from the source node. The unused values in pss→idt_val is set.

DES Scavenge

When a given node is being shut down it would be preferable to be ableto reuse any identity values allocated to the node that have not yetbeen used. Assume, for example that a given node was allocated 250identity values, but had only used 30 of them. It would be preferablenot to lose (burn) the 220 identity values that remain. To do so, themethodology of the present invention provides for information regardingthe unused identifiers to be passed to the other nodes, so that anyother node in the cluster will have the ability to use the free values.When DES is being scavenged at a cluster node, one has a curr_val andnode_max for that node. Since the DES is getting scavenged, one knowsthat (node_max−curr_val) identifiers will be wasted. So the strategyadopted in the currently preferred embodiment is to take an exclusive(EX) lock and flag the entry corresponding to that node in the array asFREE (i.e., for that node from curr_val to node_max the values are FREEto be used) and not immediately redistribute the values. If oneimmediately tries to redistribute the values, one will have multiplechunks of identifiers at other nodes and that will unnecessarilyincrease the complexity during grabbing time. If one marks the entry asFREE, then there is a slight change in the identity value allocationstrategy (which change is described below). As described above, if anode runs out of identifiers, the node will obtain additionalidentifiers from the cluster (i.e., from the range of values betweencluster_current and cluster_max available at the cluster as previouslydescribed). However, another step is introduced to first find if thereis a FREE chunk available in the cluster. If so, then the entries markedas FREE are reused.

Last Node DES Scavenge

The above-described scavenging approach needs to be revised in the eventthe node that is leaving is the last node in the cluster. In this case,all allocated, but unused identity values will be lost, unless someother steps are taken. Thus, in this case, the FREE values are writtenback to disk for future re-use. When the DES is scavenged at the lastnode in the cluster, the FREE chunks from all nodes are written to disk,in order to avoid burning all the unused identity values that wereallocated for use in the cluster. The last node leaving the cluster willhave the curr_val and node_max of all the nodes and also thecluster_current value. It will encode this information and write todisk, marking all of them as FREE. The on-disk representation iscurrently implemented as an entry in the sysattributes system table foreach node in the cluster which had FREE values. Also, one stores thevalue of cluster_current. When the DES is reinstantiated for the firsttime in the cluster, it will read from the disk and initialize the arrayvalues depending on the values stored in the disk.

Methodology for Reuse of Identifiers of Leaving Node

FIG. 6 is a flowchart illustrating the methodology 600 of the presentinvention for scavenging identifiers when a node leaves the cluster. Themethodology enables reuse of identifiers cached by a node when the nodeleaves the cluster. When a node is leaving the cluster, the identitymanager at the node takes an exclusive lock on the identity datastructure at step 601. Next, the current node's array entries are markedas FREE at step 602. Currently, this is done by setting the node valstate as FREE. At step 603, a check is made to determine whether thenode is the last node leaving the cluster having identifiers (andidentity data structure values) that need to be scavenged. If the nodeis the last node leaving the cluster the method proceeds to step 604.Otherwise, it proceeds to step 605. If the node is the last node in thecluster having the identity data structure values to be scavenged, theFREE entries and the cluster_current values are written to thesysattributes system table at step 604. Otherwise, if it is not the lastnode, the values are pushed to other nodes in the cluster using acallback mechanism at step 605. After these steps are completed, theexclusive (EX) lock on the identity data structure is released at step606.

Identifier Allocation

For implementing reuse of identifiers marked as FREE, the previouslydescribed identity value allocation strategy is modified so that when anode needs identifiers, the identity manager at the node will firstsearch for a FREE chunk. If there is no FREE chunk, only then will theidentity manager try to allocate from the chunks between cluster_currentand cluster_max as explained earlier. A node needing identity valuesfollows the general steps described below:

1. If a node finds its own FREE chunk (i.e., the node val state in theID array is FREE), it uses the chunk and marks the chunk NOT FREE;

2. Otherwise, it tries to get a FREE chunk. If it gets a FREE chunk, ituses that and marks it NOT FREE;

3. Otherwise, it tries to get a chunk from cached values in cluster(using the strategy explained earlier).

A side-effect of using FREE identity values of other nodes is that, theidentity values in a particular node can go backward.

Node Joining/Leaving

If a node joins the cluster, there is no issue until it needs to performan insert into the table. During the first insert, the identity managerat the node has to grab an exclusive (EX) lock to initialize itsidentifier (ID) block. Then, it can get the exclusive (EX) lock and getits ID block allocated. When a node leaves the cluster, the case issimilar to DES scavenging as described above. Any update to the identitydata structure is done under exclusive (EX) lock (thereby eliminatingissues).

Cluster Shutdown

Cluster shutdown should be handled very gracefully. It is the same asDES scavenge at all nodes and one should not lose any identity values.The situation is handled in the same manner as DES scavenge (asdescribed above) at all nodes with FREE entries getting written out tothe sysattributes system table (i.e. to disk), one row per node havingFREE entries.

Cluster Crash

In case of a cluster crash, since identity_gap block of identifiers aretaken from sysobjects row in every node until one runs out of values, itis possible to lose a maximum (2*identity_gap−equal_share_value) numberof identity values. When the cluster reboots, it starts allocatingidentity values from the last burned value in the sysobjects row.

Failover Handling and Recovery

Failover handling is handled in the following manner. If a node crashedand it was holding the exclusive (EX) lock at the time of crash, thenthe lock may be marked as INDOUBT. (If it was not holding exclusive (EX)lock, it is not marked as INDOUBT.) If lock is not INDOUBT, then oneneed not do anything; whatever identity values were assigned to thatnode will be lost.

If the lock is INDOUBT, then values of cluster_current and cluster_maxmay not be correct and the node doing failover will have to make thevalues consistent. If the DES was instantiated at any other node, thenone may use the values from that other node. Remember that after everyupdate of the identity data structure, the updated values are pushed tothe other nodes. This mechanism ensures that one always has the lastconsistent values of the identity data structure (and one can always usethose values). When the DES gets reinstantiated at some other nodelater, the identity values will be granted based on whatever value is inthe sysobjects row. The system loses (cluster_max−cluster_current)number of values.

A lock can also be INDOUBT if DES Scavenge was underway when the nodecrashed. In this case, one checks if the DES is available at some othernode. If the crashed node's entry is FREE, it means other node got theupdated value from the node before it crashed and one can reuse thevalues. If the entry is not FREE, it is discarded. If the DES is notavailable at other node, there is no information, thus necessitatingthat the entry be discarded for the failed node. As the approach of thepresent invention involves replication, in which current values arepushed to other nodes, the last consistent values of the identitystructure on all the nodes should always be available. Thus, theprevious values can be used in a failover situation. This is asignificant improvement from prior art approaches which rely on acentralized mechanism. With the present invention, when one node fails,the other nodes already have the necessary information and cangracefully handle the failover. In prior art centralized systems whichrely on a “master” node, all the necessary information must betransferred from the master node to its successor. If the master nodecrashes, there is also a greater potential for loss and more extensiverecovery steps are generally required. Thus, the fully distributedsystem and methodology of the present invention provides for increasedreliability.

Cluster Boot

When the cluster boots, it must glean certain context information. Here,if the cluster shut down properly, values are stored in thesysattributes system table. If this is the case, then the approach is toinitialize based on those values. An issue is how does one know whetherthe cluster was previously shutdown (meaning it needs to read identityvalues entry from sysattributes) or whether it had crashed (i.e.,sysattributes does not contain any FREE values). The solution isimplemented as follows. Whenever an entry is read in from thesysattributes table and used to instantiate the identifier datastructure, it is immediately deleted from sysattributes so that theentry is not used anymore. Thus, when a cluster boots and tries to readentries from sysattributes, it uses existing entries (if any) toinstantiate the node values, cluster_current, and cluster_max.

When a cluster boots, it checks for the following (basically thesituation when DES is instantiated for first time in cluster):

1. If there is no entry in sysattributes, instantiate the ID array withzeroes and start identifier allocation from sysobjects row value. Elsego to step 2

2. For every entry found in sysattributes

-   -   a) Use it to instantiate the identity data structure and mark it        as FREE for use by anyone in the cluster    -   b) Delete the entry from the sysattributes table

FIGS. 7A-B comprise a single flowchart illustrating the methodology 700of the present invention for instantiation/recovery of the identity datastructure when a cluster boots. At decision step 701, a check is made todetermine if the DES data version is >0. If the DES data version is >0,this means that it has obtained the last consistent version of theidentity data structure data values and the method proceeds to step 702.At step 702, the data values are used for initialization/recovery of theidentity data structures at each participating node. The system can usethe last consistent state to recover because the system always pushesupdated data to each of the participating nodes after every update ofthe identity data structure. At step 703, for every cluster node whichis down, the identity data structure array value is set to 0.

Otherwise, if the DES data version was not found to be greater than zeroat step 701, the DES does not have last consistent version and themethod proceeds to step 704. At step 704 the sysobjects system table isread to recover cluster_max. At step 705 the sysattributes system tableis read and a check is made to determine whether entries in thesysattributes system table exist at decision step 706. If entries exist,the entries are used to instantiate the identity data structure datavalues and cluster_current at step 707. At step 708, the entries used toinstantiate the identity data structure are deleted from thesysattributes system table. Otherwise if no entries are found to existat step 706, the identity data structure values are set to 0, andcluster_current is set to equal cluster_max at step 709.

Special-Case: Last Set of Identity Values

The following describes how to handle the last set of identity valuesusing this approach. Since an equal sharing approach cannot be used whenone hits the last set of identity values, an alternate mechanism isemployed to solve this special case (where one grabs identifiers fromother nodes in the cluster). Two basic approaches may be employed:

1. Periodically check whether a node is under-utilizing its identityvalues.

2. Use a grabbing method to grab identity values from other nodes.

Periodic Check Strategy

In this method, a particular node is periodically checked to determinewhether it is under-utilizing its identity values, and when detected(i.e., “yes”), then that node is asked to donate its identity values(i.e., mark the entry for that node as FREE). How does one decidewhether a node is under-utilizing its values? Heuristics may beemployed, such as evaluating whether the current value (curr_val) of anode is much less than the current cluster maximum value (cluster_max).If so, then the node is a candidate for donation. For deciding how oftento check, one can employ a strategy like check for every kth number ofsysobjects row accesses.

Reduced Equal Share Strategy

In this method, one reduces the value of equal_share_value to a reducedvalue as soon as the last chunk of values is hit. As soon as one comesto last set (when number of identity values remaining in sysobjects rowis <=identity_gap), the value of equal_share_value is set to apre-defined watermark (say last_share_value=10). After this, every nodetakes last_share_value number of identifiers from the cluster cachewhenever it needs identifiers. In this scheme, one does not need to doany grabbing from other nodes. However, this increases the number of EXLOCKS for last set of values and the total amount of identifiers that beunderutilized in the cluster, which in the worst case will be equal to(last_share_value*no._of_nodes). Whenever the last set is hit, thefollowing logic is performed:

If last_share_value<equal_share_value, then

Equal_share_value=last_share_value.

The preferred approach is a combination of the above, employing thefollowing strategy. When one hits the last point where there are novalues left in cluster and sysobjects row, a message is sent to all thenodes asking them to freeze inserts for the time being and give back alltheir identity values, similar to DES Scavenging. Thus, all the nodeswill mark their entry as FREE. At this point, one has multiple FREEchunks of identity values from which to allocate. This may be done usinga callback mechanism. Now onwards, one may start allocating from theseFREE blocks of identity values in chunks of last_share_value to eachnode. In the worst case, one will lose in the cluster a maximum of(last_share_value*no._of_nodes) number of values. Typically thelast_share_value is set to 1 to reduce number of values to a bareminimum. So, every node will take one value at a time from the FREEchunks and as a result, the maximum loss will be upper bounded byno._of_nodes.

The following steps are followed in this approach:

1. Take exclusive (EX) lock.

2. No values for current node to mark FREE, but still put all 0's incurrent node entry. Set equal_share_value=last_share_value.

3. Send message to all other nodes to surrender their values, thecallback function in the notified nodes will stop all local access tothe identity array and set identity array state as FREE. All nodes setequal_share_value=last_share_value.

4. After step 3, all the nodes have stopped their operations andsurrendered their values to be used in the cluster. Now, every node cantake last_share_value number of identifiers from the FREE chunks.

5. Current node gets last_share_value number of identifiers from theFREE chunks.

6. Push the updated values to all the other nodes in the cluster.

7. Release the exclusive (EX) lock.

Performance Study

As previously discussed, the performance of any scheme can be evaluatedon the basis of two parameters (i) the number of sysobjects row accessesto the disk and (ii) the number of exclusive (EX) locks that needs to betaken (ignoring for the time being the amount of exclusive (EX) locksand performance hit that will happen when one reaches the last chunk ofidentity values). Assuming that the total number of identity values=T, Iis the identity_gap, and N is the number of nodes, one can observe that:

(i) the number of sysobjects row accesses=(T/I), since every time onetakes I number of values from sysobjects row for distributing in thecluster.

(ii) the number of exclusive (EX) locks=(N)*(T/I), since for every setof identity_gap values, number of exclusive (EX) locks will be N. Thenumber of such sets is (T/I) and so total amount of locks=N*(T/I).

It should be noted that exclusive (EX) locks are also taken during DESScavenge operation, but these may be ignored as not quantifiable.

Detailed Internal Operation

The following description presents method steps that may be implementedusing processor-executable instructions, for directing operation of adevice under processor control. The processor-executable instructionsmay be stored on a computer-readable medium, such as CD, DVD, flashmemory, or the like. The processor-executable instructions may also bestored as a set of downloadable processor-executable instructions, forexample, for downloading and installation from an Internet location(e.g., Web server).

Identity Data Structure

For implementation of the present invention, a new identity datastructure (and related macros) is introduced in DES which isinstantiated on all nodes in the cluster. The identity data structure isas follows:

 1: typedef struct identity_info_node {  2:   BYTE    idt_node_curr[NUMERIC_MAXSIZE];    /* current value of node */  3:  BYTE     idt_node_max[NUMERIC_MAXSIZE];     /* max value of node */ 4:   BYTE    idt_node_val_state;   /* state of values in the node */ 5: } IDENTITY_INFO_NODE;  6:  IDENTITY_INFO_NODEidt_clusternodes_info[CLUSTER_IDENTITY_NODES];  7:  BYTE  idt_cluster_curr[NUMERIC_MAXSIZE]; /** cluster current value: */  8:/* States for idt_node_val_state */  9: #define IDENTITY_NODE_VAL_FREE0x01 /* FREE state means the id values were 10:   allocated to this nodebut they are now free 11: to be used in the cluster because a DES 12: scavenge or a cluster shutdown happened. 13: */ 14: 15: #defineIDENTITY_NODE_VAL_ALLOCATED  0x02 16: /* ALLOCATED state means the idvalues were allocated to this node 17: ** and are being used by thenode. So, they cant be used 18: ** by others in the cluster. 19: */ 20:#define IDENTITY_NODE_VAL_UNALLOCATED  0x04 21: /* UNALLOCATED statemeans the id values have not been allocated to this node */ 22: #defineIDENTITY_NODE_DES_INST  0x08 23: /*  state means node has instantiatedthe des */ 24: 25: #define IDENTITY_NODE_DES_LASTSET   0x10 26: /* state means last set of id values has been hit*/

The data structure comprises an array which includes the following threeentries: the node current value (idt_node_curr), the node maximum value(idt_node_max), and the state of values of the node(idt_node_val_state). The size of the array is equal to number of nodessupported in the cluster. Also, a cluster-wide OCM lock is defined toprotect this data structure. The above information is maintained atevery node. Thus, every node has the information for itself and allother nodes. Although not included in the above structure, the clustercurrent value is also maintained. In the currently preferred embodiment,the cluster max (cluster_max) value is maintained in another datastructure of the DES.

Identity Allocation Routine for Cluster Environment

In the currently preferred embodiment of the present invention, the mainidentity allocation routine (DES_NODE_NEWIDT_VALUES) of the identitymanager on a node in the cluster is called to allocate identity valuesin one of the following circumstances:

1. Node value state=UNALLOCATED, which means that a block of values needto be allocated to the node.

2. Node current value=node max value, which indicates that no identityvalues remain available at the node and an additional block ofidentifiers needs to be allocated.

3. The identity grab size is set, but (node current value+identity grabsize)<node max value.

When a new block of identifiers is to be allocated to a node, the nodetakes a cluster wide OCM lock on the identity structure to ensure thatno other node in the cluster can allocate identifiers until the currentnode finishes and releases the lock. This is illustrated as follows:

1:  /* 2:  ** Take exclusive right to update the identity structure. 3: ** Obtain ocm lock in exclusive mode. 4:  */ 5: ocm_lock(&des->d_ocb_identity, EX_OBJ); 6: 7:  /* Hold the cluster-widelock in this task */ 8:  ocm_hold(&des->d_ocb_identity);

The identity data structure provides access to the latest values for thecluster current value (cluster_curr), the cluster maximum value(cluster_max) and a node array of values. Next, a check is made todetermine if there are FREE identifiers available as follows:

 1:  /*  2:  ** First check if we have FREE entries.  3:  */  4:  if (Resource->rconfig->idt_grab_size > 1)  5:  {  6:   /* Take grab sizefrom FREE entries */  7:  }  8:  else  9:  { 10:   /* 11:   ** Firstcheck for this node entry, then 12:   ** search other node entries. 13:  */ 14:   if(des->didt.idt_clusternodes_info[nodeid].idt_node_val_state 15:     &IDENTITY_NODE_VAL_FREE) 16:   { 17:    /* FREE state should always beaccompanied by UNALLOCATED state 18:  */ 19: 20: SYB_ASSERT(des->didt.idt_clusternodes_info[nodeid].idt_node_val_state21:  & 22:    IDENTITY_NODE_VAL_UNALLOCATED); 23:  des->didt.idt_clusternodes_info[nodeid].idt_node_val_state &= 24:   ~(IDENTITY_NODE_VAL_UNALLOCATED); 25:  des->didt.idt_clusternodes_info[nodeid].idt_node_val_state &= 26:   ~(IDENTITY_NODE_VAL_FREE); 27:  des->didt.idt_clusternodes_info[nodeid].idt_node_val_state |= 28:   (IDENTITY_NODE_VAL_ALLOCATED); 29: 30:   free_node_curr.value = 31:des->didt.idt_clusternodes_info[nodeid].idt_node_curr; 32:   stat =com_(——)exctnume_inc(&free_node_curr); 33:   if (stat == O_OVERFLOW) 34:  { 35:    ocm_unhold(&des->d_ocb_identity); 36:   ocm_unlock(&des->d_ocb_identity); 37:    return(IDT_OVERFLOW); 38:  } 39: 40:  MEMMOVE(des->didt.idt_clusternodes_info[nodeid].idt_node_curr, 41:   num_rtrn->value, NUMERIC_MAXSIZE); 42: 43:   /* 44:   ** Updating thearray entries for itself and 45:   ** cluster_current. 46:   ** updateother nodes about the change. 47:   */ 48:   des_identity_notify(des,des->ddbid, 49:    des->dobjectc.objostat.objid, 50:   DES_IDT_PCM_PUSH, nodeid); 51: 52:  ocm_unhold(&des->d_ocb_identity); 53:  ocm_unlock(&des->d_ocb_identity); 54:   return IDT_SUCCESS; 55: 56:  }

As previously discussed, FREE identifiers may be available as a resultof another node leaving the cluster or surrendering its values. If ablock of FREE identifiers is found, they are allocated to the local nodeas follows:

 1:  2:  for (i = 0; i < CLUSTER_IDENTITY_NODES; i++)  3:  {  4:   /* IfFREE entry is found, allocate equal_share_value ids. */  5:   if (i ==nodeid) continue;  6:  7:   if(des->didt.idt_clusternodes_info[i].idt_node_val_state  8:    &IDENTITY_NODE_VAL_FREE)  9:   { 10:     /* FREE state should always beaccompanied by        UNALLOCATED 11: state */ 12: 13: 14:   /* Out ofall values allocate only max equal_share_value ids 15: */ 16: 17:  free_node_curr.value = 18:des->didt.idt_clusternodes_info[i].idt_node_curr; 19:  free_node_max.value = 20:des->didt.idt_clusternodes_info[i].idt_node_max; 21:   /* temp_nume =(free_node_curr + equal_share_value */ 22:   stat =com_(——)exctnume_addorsub(&free_node_curr, 23: &equal_share_value, 24:            &temp_nume, 25: 0); 26:   cmp_res =com_(——)exctnume_cmp(&temp_nume,      &free_node_max, 27:NUMECMP_DEFAULT); 28: 29:  MEMMOVE(des->didt.idt_clusternodes_info[i].idt_node_curr, 30:   des->didt.idt_clusternodes_info[nodeid].idt_node_curr, 31:   NUMERIC_MAXSIZE); 32:   if (cmp_res < 1) 33:   { 34:    /* 35:    **(free_node_curr + equal_share_value) <=       free_node_max 36:   **nodeid_curr = free_node_curr + equal_share_value; 37:   **free_node_curr = free_node_curr + equal_share_value; 38:   */ 39:   }40:   MEMMOVE(temp_nume.value, 41:   des->didt.idt_clusternodes_info[nodeid].idt_node_max, 42:   NUMERIC_MAXSIZE); 43:   MEMMOVE(temp_nume.value, 44:   des->didt.idt_clusternodes_info[i].idt_node_curr, 45:   NUMERIC_MAXSIZE); 46:   /* 47:   ** if (free_node_curr +equal_share_value) =     free_node_max ) 48:   ** there are no free idvalues left for node i anymore. 49:   */ 50:   if (cmp_res == 0) 51:   {52:    des->didt.idt_clusternodes_info[i].idt_node_val_state &= 53: 54:~(IDENTITY_NODE_VAL_FREE); 55: 56:MEMZERO(des->didt.idt_clusternodes_info[i].idt_node_curr, 57: 58:NUMERIC_MAXSIZE); 59: 60:MEMZERO(des->didt.idt_clusternodes_info[i].idt_node_max, 61: 62:NUMERIC_MAXSIZE); 63:   } 64:  } 65:  else 66:  { 67:   } 68:  MEMMOVE(des->didt.idt_clusternodes_info[i].idt_node_max, 69:   des->didt.idt_clusternodes_info[nodeid].idt_node_max, 70:   NUMERIC_MAXSIZE); 71:  des->didt.idt_clusternodes_info[i].idt_node_val_state &= 72:   ~(IDENTITY_NODE_VAL_FREE); 73:  MEMZERO(des->didt.idt_clusternodes_info[i].idt_node_curr, 74:    NUMERIC_MAXSIZE); 75:  MEMZERO(des->didt.idt_clusternodes_info[i].idt_node_max, 76:    NUMERIC_MAXSIZE); 77: 78:  } 79: 80: des->didt.idt_clusternodes_info[nodeid].idt_node_val_state &= 81:  ~(IDENTITY_NODE_VAL_UNALLOCATED); 82: des->didt.idt_clusternodes_info[nodeid].idt_node_val_state |= 83:  (IDENTITY_NODE_VAL_ALLOCATED); 84:  free_node_curr.value = 85:des->didt.idt_clusternodes_info[nodeid].idt_node_curr; 86:  stat =com_(——)exctnume_inc(&free_node_curr); 87:  if (stat == O_OVERFLOW) 88: { 89:   ocm_unhold(&des->d_ocb_identity); 90:  ocm_unlock(&des->d_ocb_identity); 91:   return(IDT_OVERFLOW); 92:  }93:  MEMMOVE(des->     didt.idt_clusternodes_info[nodeid].idt_node_curr,94:   num_rtrn->value, NUMERIC_MAXSIZE); 95:  } 96: 97:  /* 98:  **Changed the array entries for local node and node i. 99:  ** Push thischange to others. 100:  */ 101:  des_identity_notify(des, des->ddbid,102:   des->dobjectc.objostat.objid, 103:   DES_IDT_PCM_PUSH, nodeid);104: 105:  des_identity_notify(des, des->ddbid, 106:  des->dobjectc.objostat.objid, 107:   DES_IDT_PCM_PUSH, i); 108: 109: ocm_unhold(&des->d_ocb_identity); 110: ocm_unlock(&des->d_ocb_identity); 111:     return(IDT_SUCCESS); 112:   } /* if */ 113:   } /* for */ 114:  } /* if then else */

If FREE entries are found and allocated to the local node, entriesreflecting this allocation are made in the identity data structure, andthe updates are pushed to the other nodes as shown above commencing atline 101. When this is complete, the lock on the identity structure isreleased.

In other cases, however, there may be no FREE entries available. If noFREE entries are found, then a block of identifiers is allocated fromthe cluster level (if available at the cluster level). At this point acheck is made to determine if there are identifiers available at thecluster level that can be allocated to the node as follows:

 1:  /*  2:  ** Come here when there are no FREE entries in the cluster, 3:  ** and hence, need to allocate from cluster_current to cluster_max. 4:  */  5:  6:   cluster_curr.value = des->didt.idt_cluster_curr;  7: 8:  /* cluster_max is the objidentburnmax value */  9: cluster_max.value = des->dobjectc.objidentburnmax; 10: 11:  /* comparethe cluster_max and cluster_curr values */ 12:  cmp_res =com_(——)exctnume_cmp(&cluster_curr, &cluster_max, 13: NUMECMP_DEFAULT);14: 15:  /* 16:  ** First check if cluster_curr > cluster_max 17:  **Raise an error 18:  */ 19: 20:  if (cmp_res == 1) 21:  { 22: 23:  } 24: else 25:  if (cmp_res == −1) 26:  { 27:   /* 28:   ** cluster_curr <cluster_max, so check if we can 29:   ** grant this request 30:   */ 31:32:   if ( Resource->rconfig->idt_grab_size > 1) 33:   { 34:    /* 35:   ** identity_grab_size parameter is set 36:    /* 37: 38:    /* 39:   ** check if (cluster_curr + grab_size <= cluster_max) 40:    */ 41:42:    /* taking out the identity_grab_size_value */ 43:   idt_get_grab_size(&equal_share_value, &des->didt); 44: 45:    /*adding cluster_curr and identity_grab_size =       temp_nume */ 46:   stat = com_(——)exctnume_addorsub(&cluster_curr,      &equal_share_value, 47:        &temp_nume, 48: 0); 49: 50: 51:   cmp_res =       com_(——)exctnume_cmp( &cluster_max, &temp_nume, 52:NUMECMP_DEFAULT); 53:    if (cmp_res > −1) 54:    { 55:     /* 56:    ** (cluster_curr + grab_size)<= cluster_max 57:     ** Followingsteps will be done 58:     ** 1. pss->pidtvalinfo.idtstart =cluster_current + 1; 59:     ** 2. cluster_current = cluster_current +grab_size; 60:     ** set other fields in pss->pidtvalinfo 61:     */62: 63:   /* 64:   ** Nothing to be done here. Setting of the values 65:  ** done outside the loops. 66:   */ 67:  }

As previously discussed, if the cluster current value (cluster_curr)plus the grab size is less than cluster_max, then identifiers areavailable at the cluster level for allocation to the local node. If thisis the case, identity values are allocated to the node and entries inthe identity data structure are updated. (Note that these steps are doneoutside of the above routine). However, if there are no identifiersavailable at the cluster level (e.g., if(cluster_curr+grab_size)>cluster_max), additional identity values needto be obtained from disk. In this case, an allocation is obtained fromdisk for use in the cluster and a block of identifiers are allocated tothe node. Cluster_curr, cluster_max and other entries in the identitydata structure are also updated to reflect the allocation. As previouslydiscussed, special handling is also provided when the last set ofidentity values is hit. Once the updates to the identity structure havebeen made, the updated values are sent to the other nodes so that allnodes are in sync. When these steps have been completed, the clusterwide OCM lock on the identity data structure is released.

Detection of Last Node DES Scavenge/Release

As previously mentioned, when a DES Scavenge is performed at aparticular node, a check is performed to see if the node is the lastnode in the cluster that has the DES instantiated. If that is the case(i.e., true), the last node needs to write out the FREE values and valueof cluster_current to sysattributes system table. An efficient manner ofdetermining this is required. (Sending a message to all nodes askingwhether they have the DES instantiated is inefficient and is notadopted.)

A new state in the ID values state is introduced:IDENTITY_NODE_DES_INST. If this bit (state) is set for any particularnode entry in the array, then the particular node has instantiated theDES. Since the updated changes are pushed at all times, every node isguaranteed to have the correct information about every other nodewhether the DES has been instantiated at the other nodes or not. Whenthe DES is instantiated at any node, the bit is set and the update ispushed to all other nodes.

The following steps are executed at DES Scavenge time

a) Take cluster wide OCM EX lock.

b) Reset current node IDENTITY_NODE_DES_INST bit. Need to let othersknow that des has been scavenged on this node.

c) For all nodes in the cluster other than current node, check ifIDENTITY_NODE_DES_INST bit is set for any node. If the bit is set, thereis one more node which has DES instantiated so current node is not lastnode. So, push the updated values to other nodes. If bit is not set, itmeans current node is the last node. Do other processing (sysattributeswriting) as previously mentioned.

e) Release lock.

Detection of Lastset Hit Case

When one runs out of identity values, the id values that have beenallocated in the cluster need to be redistributed. An efficient way todetermine whether to do lastset allocation or regular allocation isrequired. Checking for cluster_current and cluster_max every time willnot enable this determination. In order to do, therefore, a special bitis introduced: IDENTITY_NODE_DES_LASTSET. This bit is set to indicatethat the last set has been hit and one needs to allocate id valuesdifferently.

When DES is instantiated at every node, the system resets this bit. Whenthe lastset is hit, the first node that detects that set its bit andsend a message to all other nodes to surrender their values (asexplained earlier). As part of the callback function handling, thenotified nodes (as part of the id values surrender process) set thecorresponding bit in their array value. Thus, at every node, withoutchecking for cluster_current and cluster_max the system is able toefficiently determine whether it has hit the lastset of values (and thusdo special handling for that case).

While the invention is described in some detail with specific referenceto a single-preferred embodiment and certain alternatives, there is nointent to limit the invention to that particular embodiment or thosespecific alternatives. For instance, those skilled in the art willappreciate that modifications may be made to the preferred embodimentwithout departing from the teachings of the present invention.

1. In a distributed system having a plurality of nodes, a method forallocating identifiers for use at nodes of the distributed system, themethod comprising: allocating a pool of identifiers for use in thesystem; maintaining a list of free identifiers in the pool at allparticipating nodes in the system; obtaining at a first node permissionto update the list of free identifiers; upon receiving permission toupdate the list, the first node allocating for itself a set ofidentifiers from the list of free identifiers; updating the list of freeidentifiers to reflect allocation of the set of identifiers for thefirst node; sending the updated list of free identifiers from the firstnode to other participating nodes; upon receiving the updated list offree identifiers at each other participating node, updating each otherparticipating node's respective copy of the list of free identifiers;and relinquishing the first node's permission to update the list of freeidentifiers.
 2. The method of claim 1, wherein said distributed systemcomprises a distributed database system including a plurality ofdatabase server nodes sharing access to a database.
 3. The method ofclaim 2, wherein said identifiers comprise row identifiers for uniquelyidentifying a particular row of a table of the database.
 4. The methodof claim 1, wherein said maintaining step includes maintaining the listof free identifiers in memory at each of said participating nodes. 5.The method of claim 1, wherein said obtaining step includes obtaining anexclusive lock on said list of free identifiers at each of saidparticipating nodes.
 6. The method of claim 5, wherein saidrelinquishing step includes releasing said exclusive lock.
 7. The methodof claim 1, further comprising: upon the first node leaving thedistributed system, sending a list of unused identifiers allocated tothe first node to other participating nodes, so as to enable otherparticipating nodes to use said unused identifiers.
 8. The method ofclaim 7, further comprising: allocating a set of identifiers from saidlist of unused identifiers to a given participating node when additionalidentifiers are needed at the given node.
 9. The method of claim 1,further comprising: upon the first node leaving the distributed system,determining if other nodes continue to participate in the distributedsystem; if other nodes continue to participate in the distributedsystem, sending a list of unused identifiers allocated to the first nodeto other participating nodes, so as to enable other participating nodesto use said unused identifiers; and otherwise, writing the list ofunused identifiers to disk, so as to enable future use of such unusedidentifiers.
 10. The method of claim 1, wherein said step of allocatinga pool of identifiers includes allocating a block of identifiers fromdisk and recording information on disk about allocation of the block ofidentifiers.
 11. The method of claim 10, wherein said step of the firstnode allocating for itself a set of identifiers includes allocating aset of identifiers from the pool without writing information about theallocation to disk.
 12. The method of claim 1, further comprising:allocating one or more identifiers to a task running at the first nodefrom the set of identifiers allocated to the first node when the taskneeds identifiers.
 13. The method of claim 12, wherein said stepallocating one or more identifiers to the task includes maintaininginformation about identifiers allocated to the task at the first node,without updating other participating nodes about identifiers allocatedto the task.
 14. The method of claim 13, further comprising: uponmigration of the task from the first node to a second node of thedistributed system, sending information about identifiers allocated tothe task to the second node, so that the task may use such identifiersafter migration to the second node.
 15. A computer-readable mediumhaving processor-executable instructions for performing the method ofclaim
 1. 16. A system for allocating identifiers for use at a pluralityof database server nodes sharing access to a database, the systemcomprising: a database; a plurality of database server nodes connectedto each other via a network and sharing access to the database; anidentity data structure at each database server node for maintaininginformation about identifiers allocated for use among the plurality ofdatabase server nodes; a distributed lock for regulating access to theidentity data structure at the plurality of server nodes; and anidentity manager, responsive to a request for identifiers at a givennode, for acquiring the distributed lock on the identity data structure,allocating a set of identifiers to the given node upon acquiring thedistributed lock, updating the identity data structure at said pluralityof database server nodes to reflect allocation of the set of identifiersfor the given node, and relinquishing the distributed lock aftercompletion of updates to the identity data structure at all databaseserver nodes.
 17. The system of claim 16, wherein said plurality ofdatabase server nodes comprise a cluster of database server nodessharing access to a database persistently stored on disk storage. 18.The system of claim 17, wherein the identity manager, in response to aparticular database server node leaving the cluster, updates theidentity data structure at each database server node with a list ofunused identifiers allocated to the particular database server node, soas to allow use of such unused identifiers.
 19. The system of claim 18,further comprising: allocating a set of identifiers from said list ofunused identifiers to the given node in response to the request foridentifiers at the given node.
 20. The system of claim 17, wherein theidentity manager, in response to a particular database server nodeleaving the cluster, writes a list of unused identifiers allocated tothe particular node to disk if no other nodes remain in the cluster. 21.The system of claim 16, wherein said identifiers comprise uniqueidentifiers for identifying particular rows of a table of the database.22. The system of claim 16, wherein the identity data structuremaintains the information about identifiers in memory at each of saidplurality of database server nodes.
 23. The system of claim 16, whereinthe distributed lock comprises an exclusive lock which, when invoked bythe identity manager at a given database server node, blocks access tothe identity data structure by other database server nodes.
 24. Thesystem of claim 16, wherein the identity manager tracks quantities ofidentifiers allocated for use among the plurality of database servernodes and obtains an additional pool of identifiers from the databasewhen necessary.
 25. The system of claim 24, wherein said identitymanager at a given node allocates a set of identifiers from a poolpreviously obtained from the database for use at the given node inresponse to the request from the given node.
 26. The system of claim 25,wherein said step of allocating for the given node a set of identifiersincludes allocating a set of identifiers from the pool without writinginformation about the allocation to disk.
 27. The system of claim 24,wherein the identity manager divides the pool of identifiers obtainedfrom the database into smaller blocks for allocation of said smallerblocks amongst the plurality of database server nodes.
 28. The system ofclaim 27, wherein said smaller blocks are allocated to a given databaseserver node in response to demand for identifiers at the given databaseserver node.
 29. The system of claim 27, wherein size of said smallerblock is determined based on number of database server nodes sharingaccess to the database.
 30. The system of claim 16, wherein the identitymanager allocates one or more identifiers to a particular task runningat the given node from a set of identifiers previously allocated to thegiven node when the particular task needs identifiers.
 31. The system ofclaim 30, wherein said identity manager maintains information aboutidentifiers allocated to the task at the given node, without updatingother participating nodes about identifiers allocated to the task. 32.The system of claim 31, wherein said identity manager sends informationabout identifiers allocated to the task at the given node to a secondnode upon the task migrating to the second node, so that the task mayuse the allocated identifiers after migration to the second node.
 33. Ina distributed database system having a plurality of nodes sharing accessto a database, a method for allocating identifiers for use at nodes ofthe distributed system, the method comprising: allocating a pool ofidentifiers from the database for use among the plurality of nodes ofthe distributed database system; maintaining information about the poolof identifiers at each of the nodes in an identity data structure; inresponse to a request for identifiers at a given node, providingidentifiers from a set of identifiers previously allocated for use atthe given node while identifiers remain in the set allocated for thegiven node; when no identifiers remain in the set, obtaining additionalidentifiers for use at the given node by performing substeps of:acquiring a lock on the identity data structure at each of the nodes; atthe given node, allocating identifiers available in the pool for use atthe given node; updating the identity data structure to reflectallocation of the identifiers for the given node; and upon updating theidentity data structure at all nodes, relinquishing the lock.
 34. Themethod of claim 33, wherein said identifiers comprise row identifiersfor uniquely identifying a particular row of a table of the database.35. The method of claim 33, wherein said maintaining step includesprotecting the identity data structure with a distributed lock.
 36. Themethod of claim 33, wherein said maintaining step includes maintainingthe information about the pool in memory at each of the nodes.
 37. Themethod of claim 33, further comprising: upon a first node leaving thedistributed database system, sending a list of unused identifiersallocated to the first node to at least one other node, so as to enableuse of said unused identifiers at said at least one other node.
 38. Themethod of claim 33, further comprising: upon a first node leaving thedistributed database system, determining if at least one other nodecontinues to participate in the distributed database system; if at leastone other node continues to participate, sending a list of unusedidentifiers allocated to the first node to said at least one othernodes; and otherwise, writing the list of unused identifiers to disk.39. The method of claim 33, wherein said step of allocating a pool ofidentifiers includes allocating a block of identifiers from disk andrecording information on disk about allocation of the block ofidentifiers.
 40. The method of claim 33, wherein said substep ofallocating identifiers available in the pool includes allocating suchidentifiers without writing information about the allocation to disk.41. The method of claim 33, further comprising: allocating one or moreidentifiers to a task running at the given node from identifiersallocated to the given node when the task needs identifiers.
 42. Themethod of claim 41, wherein said step of allocating one or moreidentifiers to a task includes allocating such identifiers withoutwriting information about the allocation to disk.
 43. The method ofclaim 41, wherein said step of allocating one or more identifiers to atask includes maintaining information about identifiers allocated to thetask at the given node, without updating other nodes of the distributeddatabase system about identifiers allocated to the task.
 44. The methodof claim 41, further comprising: upon migration of the task from thegiven node to a different node of the distributed system, sendinginformation about identifiers allocated to the task to the differentnode, so that the task may use such identifiers after migration.
 45. Themethod of claim 41, further comprising: when no identifiers remain inthe pool, obtaining an additional allocation of identifiers from thedatabase for use among the plurality of nodes of the distributeddatabase system.
 46. The method of claim 45, further comprising:dividing the additional allocation of identifiers obtained from thedatabase into a plurality of blocks for allocation amongst the pluralityof nodes.
 47. The method of claim 46, further comprising: allocating oneof the blocks to a given node in response to demand for identifiers atthe given node.
 48. The method of claim 46, wherein said dividing stepincludes dividing the allocation of identifiers obtained from thedatabase based on number of nodes of the distributed database system.49. A computer-readable medium having processor-executable instructionsfor performing the method of claim 33.